Search CORE

612 research outputs found

A map of human genome variation from population-scale sequencing

Author: Li Yun
The 1000 Genomes Project Consortium
Publication venue
Publication date: 01/01/2010
Field of study

The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother–father–child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10−8 per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research

Carolina Digital Repository

An integrated map of genetic variation from 1,092 human genomes

Author: Li Yun
The 1000 Genomes Project Consortium
Publication venue
Publication date: 01/01/2012
Field of study

By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations

Carolina Digital Repository

A global reference for human genetic variation

Author: Li Yun
The 1000 Genomes Project Consortium
Publication venue
Publication date: 01/01/2015
Field of study

The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies

Carolina Digital Repository

HapZipper: sharing HapMap populations just got easier

Author: Ahn
Altshuler
Brandon
Burrows
Christley
Dublin
Eran Elhaik
Joel S. Bader
Levy
Pritam Chanda
Sansom
Schuster
Service
The 1000 Genomes Project Consortium
Wang
Willyard
Ziv
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2011
Field of study

The rapidly growing amount of genomic sequence data being generated and made publicly available necessitate the development of new data storage and archiving methods. The vast amount of data being shared and manipulated also create new challenges for network resources. Thus, developing advanced data compression techniques is becoming an integral part of data production and analysis. The HapMap project is one of the largest public resources of human single-nucleotide polymorphisms (SNPs), characterizing over 3 million SNPs genotyped in over 1000 individuals. The standard format and biological properties of HapMap data suggest that a dedicated genetic compression method can outperform generic compression tools. We propose a compression methodology for genetic data by introducing H ap Z ipper , a lossless compression tool tailored to compress HapMap data beyond benchmarks defined by generic tools such as gzip , bzip2 and lzma . We demonstrate the usefulness of H ap Z ipper by compressing HapMap 3 populations to <5% of their original sizes. H ap Z ipper is freely downloadable from https://bitbucket.org/pchanda/hapzipper/downloads/HapZipper.tar.bz

CiteSeerX

Crossref

Lund University Publications

PubMed Central

White Rose Research Online

A genomic data viewer for iPad

Author: Douglass Turner
H Li
H Thorvaldsdóttir
Helga Thorvaldsdóttir
James T Robinson
Jill P Mesirov
JT Robinson
MN Cabili
The 1000 Genomes Project Consortium
The ENCODE Project Consortium
WJ Kent
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

A bi-objective feature selection algorithm for large omics datasets

Author: Almuallim
Boros
Cavique
Cavique
Chandrashekar
Chung
Chvatal
Collette
Crama
Joncour
Kira
Liu
Pawlak
Pawlak
Peters
Polkowski
Smet
Stephens
Talbi
The 1000 Genomes Project Consortium
Yao
Publication venue: 'Wiley'
Publication date: 01/01/2018
Field of study

Special Issue: Fourth special issue on knowledge discovery and business intelligence.Feature selection is one of the most important concepts in data mining when dimensionality reduction is needed. The performance measures of feature selection encompass predictive accuracy and result comprehensibility. Consistency based methods are a significant category of feature selection research that substantially improves the comprehensibility of the result using the parsimony principle. In this work, the bi-objective version of the algorithm Logical Analysis of Inconsistent Data is applied to large volumes of data. In order to deal with hundreds of thousands of attributes, heuristic decomposition uses parallel processing to solve a set covering problem and a cross-validation technique. The bi-objective solutions contain the number of reduced features and the accuracy. The algorithm is applied to omics datasets with genome-like characteristics of patients with rare diseases.The authors would like to thank the FCT support UID/Multi/04046/2013. This work used the EGI, European Grid Infrastructure, with the support of the IBERGRID, Iberian Grid Infrastructure, and INCD (Portugal).info:eu-repo/semantics/publishedVersio

Crossref

Repositório Aberto da Universidade Aberta

Repositório Científico do Instituto Nacional de Saúde

Sequence data of six unusual alleles at SE33 and D1S1656 STR Loci

Author: A global reference for human genetic variation The 1000 Genomes Project Consortium
Alsafiah
Borsuk
Gettings
Gettings
Gill
Gill
Guo
Hares
Kline
Moller
Novroski
Parson
Rolf
Ruitberg
Wang
Wiegand
Publication venue: 'Wiley'
Publication date: 01/10/2018
Field of study

When profiling a reference dataset of 500 DNA samples for the population of Saudi Arabia, using the GlobalFiler® PCR amplification kit, six unusual alleles were detected. At the SE33 locus, four novel alleles were found: 2, 14.3, 20.3, and 38; two alleles, at the D1S1656 locus: 7 and 8, had been previously reported, but no published sequence data was available. The D1S1656 alleles were sequenced using ForenSeq™ DNA Signature Prep with the MiSeq FGx System (Illumina, USA). As the SE33 is not reported by available Massively Parallel Sequencing (MPS) systems, samples that exhibited the unreported alleles were sequenced using BigDye™ Terminator v3.1 Cycle Sequencing Kit. Here we present the sequence and structure of the previously uncharacterized alleles

CLoK

Crossref

Evaluation of variant detection software for pooled next-generation sequence data

Author: 1000 Genomes Project Analysis Group
A Grada
A McKenna
A Wilm
DC Koboldt
H Li
H Li
Howard W. Huang
J McClellan
James C. Mullikin
JK Teer
LG Biesecker
MA DePristo
Nancy F. Hansen
The 1000 Genomes Project Consortium
V Bansal
Z Wei
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Mutation Rate Distribution Inferred from Coincident SNPs and Coincident Substitutions

Author: Blake
Burgess
Clark
Desai
Ewing
Green
Gutenkunst
Hernandez
Hess
Hey
Hobolth
Hodgkinson
Hodgkinson
Hwang
Ines Hellmann
Kent
Kimura
Levy
Lynch
Philip L. F. Johnson
Smit
The 1000 Genomes Project Consortium
The Chimpanzee Sequencing and Analysis Consortium
Walser
Yang
Zhang
Publication venue: Oxford University Press
Publication date
Field of study

Mutation rate variation has the potential to bias evolutionary inference, particularly when rates become much higher than the mean. We first confirm prior work that inferred the existence of cryptic, site-specific rate variation on the basis of coincident polymorphisms—sites that are segregating in both humans and chimpanzees. Then we extend this observation to a longer evolutionary timescale by identifying sites of coincident substitutions using four species. From these data, we develop analytic theory to infer the variance and skewness of the distribution of mutation rates. Even excluding CpG dinucleotides, we find a relatively large coefficient of variation and positive skew, which suggests that, although most sites in the genome have mutation rates near the mean, the distribution contains a long right-hand tail with a small number of sites having high mutation rates. At least for primates, these quickly mutating sites are few enough that the infinite sites model in population genetics remains appropriate

Crossref

PubMed Central

A Geometric Framework for Evaluating Rare Variant Tests of Association

Author: 1000 Genomes Project Consortium
Asimit
Bansal
Basu
Cooper
Dai
Dering
Feng
Gibson
Han
Ionita-Laza
Ladouceur
Li
Li
Li
Lin
Luedtke
Madsen
Mayer-Jochimsen
Morgenthaler
Morris
Neale
Nelson
Pan
Powers
Price
Quintana
Rivas
Sul
Sun
Tennessen
The ENCODE Project Consortium
Tintle
Torgerson
Wu
Yi
Zawistowski
Zhang
Publication venue: 'Wiley'
Publication date: 01/05/2013
Field of study

The wave of next‐generation sequencing data has arrived. However, many questions still remain about how to best analyze sequence data, particularly the contribution of rare genetic variants to human disease. Numerous statistical methods have been proposed to aggregate association signals across multiple rare variant sites in an effort to increase statistical power; however, the precise relation between the tests is often not well understood. We present a geometric representation for rare variant data in which rare allele counts in case and control samples are treated as vectors in Euclidean space. The geometric framework facilitates a rigorous classification of existing rare variant tests into two broad categories: tests for a difference in the lengths of the case and control vectors, and joint tests for a difference in either the lengths or angles of the two vectors. We demonstrate that genetic architecture of a trait, including the number and frequency of risk alleles, directly relates to the behavior of the length and joint tests. Hence, the geometric framework allows prediction of which tests will perform best under different disease models. Furthermore, the structure of the geometric framework immediately suggests additional classes and types of rare variant tests. We consider two general classes of tests which show robustness to noncausal and protective variants. The geometric framework introduces a novel and unique method to assess current rare variant methodology and provides guidelines for both applied and theoretical researchers.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/97460/1/gepi21722.pd

Crossref

Dordt College

PubMed Central

Deep Blue Documents at the University of Michigan